Mastering Web Scraping for Data Collection

MethodsNET Workshop

Marine Bardou, Aurélien Goutsmedt, and Thomas Laloux (UC Louvain)

2024-10-31

1 Training Goals

Motivations

First Session:

  • Giving a basic understanding of what web scraping is and what it can do
  • Discussing ethical (and legal) issues linked to web scraping
  • Proposing a roadmap for understanding how to proceed for practising web scraping using R
  • Providing bits of codes and practical tips


Second Session:

  • Hands-on practice with different exercises by level of difficulty

Requisits

  • Need of R and RStudio for the second session
  • These slides are built from a .qmd (quarto) document \(\Rightarrow\) all the codes used in these slides can be run in RStudio
# These lines of code has to be run before if you want to install all the packages directly

# pacman will be used to install (if necessary) and load packages
# We install pacman if it is not already installed
if(length(grep("pacman", installed.packages())) == 0) install.packages("pacman")
library(pacman)

# Installing the needed packages in advance
p_load(tidyverse, # basic suite of packages
       glue, # useful for building string (notably for url)
       scico, # color palettes
       patchwork, # for juxtaposition of graphs
       DT) # to display html tables

2 What is web Scraping

What is web scraping ?

  • The web scraping is a method for extracting data available in the World Wide Web
  • The World Wide Web, or “Web”, is a network of websites (online documents coded in html and css)
  • A web scraper is a program, for instance in R, that automatically read the html structure of a website and extract the relevant content (text, hypertext references, tables)
    • No need to fully understand html and css
  • Useful when many pages to scrape

What is HTML and CSS?

API vs. web scraping

  • API (Application Programming Interface) provides a structured and predictable way to retrieve data from a service. It’s like ordering from a menu; you request specific data and receive it in a structured format
  • Web Scraping is the process of programmatically extracting data from the web page’s HTML itself. It’s akin to manually copying information from a book; you decide what information you need and how to extract it

API vs. web scraping

  • Control and Structure: APIs offer structured access to data, whereas web scraping requires parsing HTML and often cleaning the data yourself.
  • Ease of Use: Using an API can be simpler since it’s designed for data access (but not always the case). Scraping requires dealing with HTML changes and is more prone to breaking.
  • Availability: Not all websites offer an API, making web scraping a necessity in some cases.
  • Limitations and Authorization: APIs often have rate limits and may require authentication, but approve access to the data. Web scraping can bypass these limits but might violate terms of service.

Small data everywhere

  • A large possibility of data you can collect:
    • official documents/speeches
    • agenda and meetings
    • list of personnel or experts in commission
    • Laws or negotiations
  • We can do it “historically” through the Internet Archive

Building databases

Involves a series of questions:

  • What’s your research question and which data would be appropriate to answer it?
  • How much data to collect?
    • Trade-off between collecting a lot of information (which requires more time) and risking to miss some information at a later step
  • How to scrape the data? In which format?
    • Interaction between extracting data properly in a first step or cleaning it in a second step
  • What do you loose by doing it automatically rather than manually?
  • How to analyse/understand my new data?
  • How to update my database?

3 The Ethics of Web Scraping

Ethical considerations

  • Legal Considerations: Not all data is free to scrape. Websites’ terms of service may explicitly forbid web scraping, and in some jurisdictions, scraping can have legal implications
    • What is “forbidden” by a website is not necessary “illegal”
  • Privacy Concerns: Scraping personal data can raise significant privacy issues and may be subject to regulations like GDPR in Europe
  • Website Performance: Scraping, especially if aggressive (e.g., making too many requests in a short period), can negatively impact the performance of a website, affecting its usability for others

Questions at stake (Krotov, Johnson, and Silva 2020)

Krotov, Johnson, and Silva (2020)

Krotov, Johnson, and Silva (2020)

Ethical practices

  • Respect robots.txt: This file on websites indicates which parts should not be scraped
  • Rate Limiting: Making requests at a reasonable rate to avoid overloading the website’s server
  • User-Agent String: Identifying your scraper can help website owners understand the nature of the traffic
  • Data Use: Consider the ethical implications of how scraped data is used. Ensure it respects the privacy and rights of individuals

4 How to scrape a website?

The useful packages in R

  • rvest: navigating website, scraping and cleaning html code
  • polite: responsible web etiquette (informing the website that you are scraping)
  • RSelenium: using a bot to interact with a website
p_load(rvest, # scraping and manipulating html pages
       polite, # scraping ethically
       RSelenium) # scraping by interacting with RSelenium

The Role of Sitemaps

  • Sitemap: to inform search engines about URLs on a website that are available for web crawling
    • Understand the structure of a website
    • Find where is the information we want to extract

Being respectful of the website

Declaring yourself:

session <- polite::bow(bis_website_path, 
                       user_agent = "polite R package - used for academic training by 
                       Aurélien Goutsmedt (aurelien.goutsmedt[at]uclouvain.be)")
cat(session$robotstxt$text)
#Format is:
#       User-agent: <name of spider>
#       Disallow: <nothing> | <path>
#-------------------------------------------

User-Agent: *
Disallow: /dcms
Disallow: /metrics/
Disallow: /search/
Disallow: /staff.htm
Disallow: /embargo/
Disallow: /app/
Disallow: /goto.htm
Disallow: /login
#Disallow: /cbhub
Disallow: /cbhub/goto.htm
Disallow: /doclist/
# Committee comment letters
Disallow: /publ/bcbs*/
Disallow: /bcbs/ca/
Disallow: /bcbs/commentletters/
Disallow: /*/publ/comments/
# Hide the Basel Framework standards, only chapters should be indexed.
Disallow: /basel_framework/standard/

Sitemap: https://www.bis.org/sitemap.xml
session$robotstxt$sitemap
    field useragent                           value
1 Sitemap         * https://www.bis.org/sitemap.xml

Using sitemap

Code
# This function goes to a sitemap page, and extract all the urls found
extract_url_from_sitemap <- function(url, delay = 1) { 
  urls <- read_html(url) %>% 
    html_elements(xpath = ".//loc") %>% 
    html_text()
  Sys.sleep(delay) # You set a delay to avoid overloading the website
  return(urls)
}

# insistently allows to retry when you did not succeed in loading the page
insistently_extract_url <- insistently(extract_url_from_sitemap, 
                                       rate = rate_backoff(max_times = 5)) 

document_pages <- extract_url_from_sitemap(session$robotstxt$sitemap$value) %>% 
  .[str_detect(., "documents")] # We keep only the URLs for documents

bis_pages <- map(document_pages[1:5], # showing the code just on the first five years
                 ~insistently_extract_url(url = ., 
                                          delay = session$delay))

bis_pages <- tibble(year = str_extract(document_pages[1:5], "\\d{4}"),
                    urls = bis_pages) %>% 
  unnest(urls)

Scraping a BIS speech with rvest

url_speech <- "https://www.bis.org/review/r241022f.htm"
page <- read_html(url_speech)
print(page)
{html_document}
<html class="no-js" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
[1] <head>\n<meta content="IE=edge" http-equiv="X-UA-Compatible">\n<meta cont ...
[2] <body>\n<div class="dt tagwidth" id="body">\n<div id="bispage">\n<noscrip ...
page %>% 
  html_element("h1") %>%
  html_text
[1] "Joachim Nagel: Dot plots for the Eurosystem?"
page %>% 
  html_element("#extratitle-div p:nth-child(1)") %>% 
  html_text
[1] "Speech by Dr Joachim Nagel, President of the Deutsche Bundesbank, at Harvard University, Cambridge, 22 October 2024."
page %>% 
  html_elements(".Reden") %>% 
  html_text
[1] "Ladies and gentlemen,"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[2] "it is a great pleasure to be at Harvard again, to meet long time companions like Hans-Helmut Kotz and to exchange ideas with top scientists such as Benjamin Friedman. When I was in this round two years ago, we were dealing with an unprecedented global inflation spike. Fortunately, the worst is behind us, and inflation in the euro area is heading back to the Eurosystem's target. We have not brought the inflation ship safely back into the 2% harbour, but the port is in sight. Thus, I can focus on another question today."                                                                                                                                                                
[3] "Before I do that, let me share an analogy to set the stage for my discussion. Back in the 1970s and 1980s, the field of economics was split into two seemingly incompatible schools of thought: New Keynesian and New Classical. Their proponents were not too polite in their language, calling assumptions \"foolishly restrictive\" or comparing an opponent to someone attempting to pass himself off as Napoleon Bonaparte. But, over time, ideas from both camps ultimately merged to form a consensus called the New Neoclassical Synthesis, the very foundation of modern macroeconomics. Gregory Mankiw neatly described this story in his essay \"The Macroeconomist as Scientist and Engineer\"."
[4] "The takeaway from this analogy is that complex issues are rarely black or white. With this in mind, I want to explore whether the conduct of monetary policy in the euro area could be enhanced by offering more detailed and nuanced information regarding its future outlook. More specifically, today I will address the following question: Should the Eurosystem introduce dot plots?"                                                                                                                                                                                                                                                                                                                 

Scraping BIS: understanding URLs

page <- 2
day <- "01"
month <- "10"
year <- 2024 # we want to look at all the speech since October 1st 2024
url_second_page <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page={page}&cbspeeches_page_length=25")

print(url_second_page)
https://www.bis.org/cbspeeches/index.htm?fromDate=01%2F10%2F2024&cbspeeches_page=2&cbspeeches_page_length=25

Scraping one page: using a scrape helper

  • Scraping add-ons on browser help you navigating through elements in a webpage
    • XPath is the path towards a specific part of a webpage
    • CSS selectors are first for styling web pages, but allows to match position of an element within HTML structures
  • Typical scraping helpers: ScrapeMate and SelectorGadget

Scraping the query page: mixing rvest and RSelenium

# Launch Selenium to go on the website of bis
driver <- rsDriver(browser = "firefox", # can also be "chrome"
                   chromever = NULL,
                   port = 4444L) 
remote_driver <- driver[["client"]]

Scraping one page: mixing rvest and RSelenium

remote_driver$navigate(url_second_page)
Sys.sleep(session$delay)


element <- remote_driver$findElement("css selector", ".item_date")
element$getElementText()[[1]]
[1] "09 Oct 2024"


elements <- remote_driver$findElements("css selector", ".item_date")
length(elements)
[1] 25
elements[[25]]$getElementText()[[1]]
[1] "02 Oct 2024"

Scraping one page

Code
data_page <- tibble(date = remote_driver$findElements("css selector", ".item_date") %>% 
                      map_chr(., ~.$getElementText()[[1]]),
                    info = remote_driver$findElements("css selector", ".item_date+ td") %>% 
                      map_chr(., ~.$getElementText()[[1]]),
                    url = remote_driver$findElements("css selector", ".dark") %>% 
                      map_chr(., ~.$getElementAttribute("href")[[1]])) %>% 
  separate(info, c("title", "description", "speaker"), "\n")

Scraping all the pages

starting_url <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page=1&cbspeeches_page_length=25")
remote_driver$navigate(starting_url)

# Extract the total number of pages
nb_pages <- remote_driver$findElement("css selector", ".pageof")$getElementText()[[1]] %>%
  str_remove_all("Page 1 of ") %>%
  as.integer()

# creating a list objet to allocate progressively information
metadata <- vector(mode = "list", length = nb_pages)

for(page in 1:nb_pages){
  url <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page={page}&cbspeeches_page_length=25")
  remote_driver$navigate(url)
  nod <- nod(session, url) # introducing politely to the new page
  Sys.sleep(session$delay) # using the delay time set by polite

  metadata[[page]] <- tibble(date = remote_driver$findElements("css selector", ".item_date") %>% 
                            map_chr(., ~.$getElementText()[[1]]),
                          info = remote_driver$findElements("css selector", ".item_date+ td") %>% 
                            map_chr(., ~.$getElementText()[[1]]),
                          url = remote_driver$findElements("css selector", ".dark") %>% 
                            map_chr(., ~.$getElementAttribute("href")[[1]])) 
}

metadata <- bind_rows(metadata) %>% 
  separate(info, c("title", "description", "speaker"), "\n")
driver$server$stop() # we close the bot once we've finished

5 Exercises

Exercises

6 Resources

Useful Resources

References

Krotov, Vlad, Leigh Johnson, and Leiser Silva. 2020. “Tutorial: Legality and Ethics of Web Scraping.”